Vehicle Attributes and Effects on MPG
The purpose of this study to was to create a linear regression model based on vehicle data from 1985 Ward’s Automotive Yearbook in hopes of uncovering possible relationships between MPG (miles per gallon) and other vehicle attributes.
Through separating city and highway MPG, I hoped to uncover possible differences between how to the two rates are impacted by the different vehicle attributes given in the data.
The dataset in this study contains 205 observations, 6 of which were removed due to missing values, leading to grand total of 199 observations. Additionally, the dataset originally contained 26 variables, but I chose to only study 15 of these, subetting the data based upon high amounts of missing values and relevancy to the research question.
| highway | city | fueltype | aspiration | wheelbase | length | width | height | curbweight | enginesize | bore | stroke | compressionratio | horsepower | peakrpm |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 21 | gas | std | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 |
| 27 | 21 | gas | std | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 |
| 26 | 19 | gas | std | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 |
| 30 | 24 | gas | std | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 |
| 22 | 18 | gas | std | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 |
| 25 | 19 | gas | std | 99.8 | 177.3 | 66.3 | 53.1 | 2507 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 |
| 25 | 19 | gas | std | 105.8 | 192.7 | 71.4 | 55.7 | 2844 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 |
| 25 | 19 | gas | std | 105.8 | 192.7 | 71.4 | 55.7 | 2954 | 136 | 3.19 | 3.40 | 8.5 | 110 | 5500 |
| 20 | 17 | gas | turbo | 105.8 | 192.7 | 71.4 | 55.9 | 3086 | 131 | 3.13 | 3.40 | 8.3 | 140 | 5500 |
| 22 | 16 | gas | turbo | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | 131 | 3.13 | 3.40 | 7.0 | 160 | 5500 |
The following explanatory variables were the focus of our analysis:
Fuel Type: gas or diesel
Aspiration: standard (std) or turbo
Wheel Base: the horizontal distance (in.) between the centers of the front and rear wheels
Length: length (in.) of vehicle
Width: width (in.) of vehicle
Height: height (in.) of vehicle
Curb-weight: the published weight (lbs.) of a vehicle with a full tank of fuel and all fluids filled
Engine-size: the volume (cubic in.) of fuel and air that can be pushed through a car’s cylinders
Bore: diameter (in.) of engine’s cylinder
Stroke: depth (in.) of engine’s cylinder
Compression-ratio: ratio measuring how much cylinder volume is able to be compressed
Horsepower: the power an engine produces (550 ft-lbs per second)
Peak-RPM: the max speed an engine can spin (rotations per second)
14 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 44.2221836809
fueltype .
aspiration .
wheelbase .
length -0.0127124518
width 0.2664455176
height .
curbweight -0.0116197033
enginesize .
bore .
stroke 0.0989608427
compressionratio 0.5321475744
horsepower -0.0034414693
peakrpm -0.0009320573
14 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 4.015802e+00
fueltype .
aspiration .
wheelbase .
length -5.653066e-04
width .
height .
curbweight -2.747911e-04
enginesize .
bore .
stroke .
compressionratio 1.994827e-02
horsepower -1.869499e-03
peakrpm -5.245971e-06
Anderson-Darling normality test
data: residuals1
A = 0.7873, p-value = 0.04019
Anderson-Darling normality test
data: residuals
A = 2.107, p-value = 2.214e-05
---
title: "Final Project"
author: "Jesse Devitt"
date: "2023-11-30"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: cosmo
primary: "blue"
orientation: columns
vertical_layout: fill
source_code: embed
---
<style>
.chart-title { /* chart_title */
font-size: 20px;
}
body{ /* Normal */
font-size: 18px;
}
</style>
```{r setup, include=FALSE}
library(flexdashboard)
library(shiny)
library(shinydashboard)
```
Introduction
===
<head>
<base target = "_blank">
</head>
<font size=5>
**Vehicle Attributes and Effects on MPG**
</font>
Column {data-width=650}
-----------------------------------------------------------------------
### Motivation
The purpose of this study to was to create a linear regression model based on vehicle data from 1985 Ward's Automotive Yearbook in hopes of uncovering possible relationships between MPG (miles per gallon) and other vehicle attributes.
Through separating city and highway MPG, I hoped to uncover possible differences between how to the two rates are impacted by the different vehicle attributes given in the data.
The dataset in this study contains 205 observations, 6 of which were removed due to missing values, leading to grand total of 199 observations.
Additionally, the dataset originally contained 26 variables, but I chose to only study 15 of these, subetting the data based upon high amounts of missing values and relevancy to the research question.
```{r}
knitr::opts_chunk$set(echo = TRUE)
library(pacman)
library(tidyverse)
library(plotly)
library(corrplot)
library(RColorBrewer)
library(stats)
vehicles <- read.csv("~/Library/Mobile Documents/com~apple~CloudDocs/MTH 369/automobile/imports-85.data", header=FALSE)
vehicles <- vehicles %>% select(-c(V1, V2, V3, V6, V7, V8, V9, V15, V16, V18, V26))
names(vehicles) <- c("fueltype", "aspiration", "wheelbase", "length", "width", "height", "curbweight", "enginesize", "bore", "stroke", "compressionratio", "horsepower", "peakrpm", "city", "highway")
vehicles$city <- as.numeric(vehicles$city)
vehicles$highway <- as.numeric(vehicles$highway)
vehicles$curbweight <- as.numeric(vehicles$curbweight)
vehicles$enginesize <- as.numeric(vehicles$enginesize)
vehicles$bore <- as.numeric(vehicles$bore)
vehicles$stroke <- as.numeric(vehicles$stroke)
vehicles$compressionratio <- as.numeric(vehicles$compressionratio)
vehicles$horsepower <- as.numeric(vehicles$horsepower)
vehicles$peakrpm <- as.numeric(vehicles$peakrpm)
vehicles$peakrpm[vehicles$peakrpm == "?"] <- NA
vehicles$horsepower[vehicles$horsepower == "?"] <- NA
vehicles$stroke[vehicles$stroke == "?"] <- NA
vehicles$bore[vehicles$bore == "?"] <- NA
vehicles$fueltype <- as.factor(vehicles$fueltype)
vehicles$aspiration <- as.factor(vehicles$aspiration)
vehicles <- vehicles[complete.cases(vehicles),]
vehicles <- vehicles[, c("city", names(vehicles)[-which(names(vehicles) == "city")])]
vehicles <- vehicles[, c("highway", names(vehicles)[-which(names(vehicles) == "highway")])]
knitr::kable(vehicles[1:10,])
```
Column {data-width=350}
-----------------------------------------------------------------------
### Variable Index
The following explanatory variables were the focus of our analysis:
- Fuel Type: gas or diesel
- Aspiration: standard (std) or turbo
- Wheel Base: the horizontal distance (in.) between the centers of the front and rear wheels
- Length: length (in.) of vehicle
- Width: width (in.) of vehicle
- Height: height (in.) of vehicle
- Curb-weight: the published weight (lbs.) of a vehicle with a full tank of fuel and all fluids filled
- Engine-size: the volume (cubic in.) of fuel and air that can be pushed through a car's cylinders
- Bore: diameter (in.) of engine's cylinder
- Stroke: depth (in.) of engine's cylinder
- Compression-ratio: ratio measuring how much cylinder volume is able to be compressed
- Horsepower: the power an engine produces (550 ft-lbs per second)
- Peak-RPM: the max speed an engine can spin (rotations per second)
EDA
===
Column {.tabset data-width=650}
---
### Highway
```{r}
ggplot(vehicles, aes(x = highway)) + geom_histogram(color = "white")
```
### City
```{r}
ggplot(vehicles, aes(x = city)) + geom_histogram(color = "white")
```
Column {data-width=350}
---
### Explanation
Correlation Exploration
===
Column {.tabset data-width=650}
---
### Highway
```{r, echo=FALSE}
highwaynumeric <- vehicles %>% select(-c(fueltype, aspiration, city))
m1 <- round(cor(highwaynumeric), 2)
corrplot(m1, method = c("number"),type="upper",main="Highway MPG",mar=c(0,0,1,0), number.cex = 0.5)
```
### City
```{r, echo=FALSE}
citynumeric <- vehicles %>% select(-c(fueltype, aspiration, highway))
m <- round(cor(citynumeric), 2)
corrplot(m, method = c("number"),type="upper",main="City MPG",mar=c(0,0,1,0), number.cex = 0.5)
```
Column {data-width=350}
---
### Explanation of Collinearity
LASSO
===
Column {data-width=300}
---
### Lambda Estimate for Highway Model
```{r, fig.align='center', echo=FALSE}
x<-as.matrix(vehicles[,3:15])
y1<-vehicles$highway
set.seed(2000)
proportion_split<-0.7
train<-sample(1:nrow(x), round(nrow(x)*proportion_split))
y1.train<-y1[train]
y1.test<-y1[-train]
x.train<-x[train,]
x.test<-x[-train,]
library(glmnet)
set.seed(2000)
cv.lasso1<-cv.glmnet(x.train, y1.train, alpha = 1)
#cv.lasso1$lambda.min
plot(cv.lasso1)
```
### Reduced Highway MPG Model
```{r, fig.align='center', echo=FALSE}
model1<-glmnet(x.train, y1.train, alpha = 1, lambda = cv.lasso1$lambda.min)
coef1<-coef(model1)
#to compute training SSE from LASSO regression
y_predictedtrain1 <- predict(model1, s = cv.lasso1$lambda.min, newx = x.train)
SSEtrain1<-sum((y_predictedtrain1-y1.train)^2)
residuals1 <- y_predictedtrain1 - y1.train
#Computing R-squared
SSTOtrain1<-sum((y1.train-mean(y1.train))^2)
R2train1<-1-SSEtrain1/SSTOtrain1
#to compute testing SSE from LASSO regression
y_predictedtest1 <- predict(model1, s = cv.lasso1$lambda.min, newx = x.test)
SSEtest1<-sum((y_predictedtest1-y1.test)^2)
#Computing R-squared
SSTOtest1<-sum((y1.test-mean(y1.test))^2)
R2test1<-1-SSEtest1/SSTOtest1
print(coef1)
```
Column {data-width=300}
---
### Estimated Lambda for City Model
```{r, fig.align='center', echo=FALSE}
library(MASS)
#bc<-boxcox(city~peakrpm+horsepower+compressionratio+curbweight+length, data = vehicles)
#lambda<-bc$x[which.max(bc$y)]
x<-as.matrix(vehicles[,3:15])
y<-log(vehicles$city)
set.seed(2000)
proportion_split<-0.7
train<-sample(1:nrow(x), round(nrow(x)*proportion_split))
y.train<-y[train]
y.test<-y[-train]
x.train<-x[train,]
x.test<-x[-train,]
library(glmnet)
set.seed(2000)
cv.lasso<-cv.glmnet(x.train, y.train, alpha = 1)
#cv.lasso$lambda.min
plot(cv.lasso)
```
### Reduced City MPG Model
```{r, fig.align='center', echo=FALSE}
model<-glmnet(x.train, y.train, alpha = 1, lambda = cv.lasso$lambda.min)
coef<-coef(model)
#to compute training SSE from LASSO regression
y_predictedtrain <- predict(model, s = cv.lasso$lambda.min, newx = x.train)
SSEtrain<-sum((y_predictedtrain-y.train)^2)
residuals<-y_predictedtrain - y.train # fitted values are y_predicted
#Computing R-squared
SSTOtrain<-sum((y.train-mean(y.train))^2)
R2train<-1-SSEtrain/SSTOtrain
#to compute testing SSE from LASSO regression
y_predictedtest <- predict(model, s = cv.lasso$lambda.min, newx = x.test)
SSEtest<-sum((y_predictedtest-y.test)^2)
#Computing R-squared
SSTOtest<-sum((y.test-mean(y.test))^2)
R2test<-1-SSEtest/SSTOtest
print(coef)
```
Column {data-width=400}
---
### Explanation
Subset EDA
===
Column {.tabset data-width=500}
---
### City MPG
```{r, fig.align='center', echo=FALSE}
#cfit <- lm(city~peakrpm+horsepower+compressionratio+curbweight+length, data = vehicles)
#summary(cfit)
```
### Highway MPG
```{r, fig.align='center', echo=FALSE}
#hfit <- lm(highway~peakrpm+horsepower+compressionratio+curbweight+length+stroke+width, data = vehicles)
#summary(hfit)
```
Column {.tabset data-width=500}
---
### Peak RPM
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = peakrpm)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "red"))
```
### Horsepower
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = horsepower)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "red"))
```
### Compression Ratio
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = compressionratio)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "red"))
```
### Curb Weight
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = curbweight)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "red"))
```
### Length
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = length)) + geom_point(aes(y = city, color = "City MPG"), size = 1) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("City MPG" = "blue", "Highway MPG" = "red"))
```
### Stroke
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = stroke)) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("Highway MPG" = "red"))
```
### Width
```{r, fig.align='center', echo=FALSE}
ggplot(vehicles, aes(x = width)) + geom_point(aes(y = highway, color = "Highway MPG"), size = 1 ) + scale_color_manual(values = c("Highway MPG" = "red"))
```
LRM
===
Column {data-width=350}
---
### Explanation
```{r, fig.align='center', echo=FALSE}
```
Column {.tabset data-width=650}
---
### Linearity
```{r, fig.align='center', echo=FALSE, out.width="50%"}
plot(residuals1~y_predictedtrain1, xlab = "Fitted Values", ylab = "Residuals", main = "Highway MPG", col = "red")
abline(h=0)
plot(residuals~y_predictedtrain, xlab = "Fitted Values", ylab = "Residuals", main = "City MPG", col = "blue")
abline(h=0)
```
### Normality
```{r, fig.align='center', echo=FALSE, out.width="50%"}
library(nortest)
qqnorm(residuals1)
qqline(residuals1)
ad.test(residuals1)
qqnorm(residuals)
qqline(residuals)
ad.test(residuals)
```